Analysis and visualization of WineEnthusiast wine reviews¶

Author: Manuele Nolli, student BSc Computer Science SUPSI

Date: 28.11.2022

Mail: manuele.nolli@student.supsi.ch

Introduction¶

This document is an analysis of a public dataset found on Kaggle.com

The dataset contains 80k wine reviews with variety, location, winery, price, points, taster nam and description.

My analysis will focus on the following questions:

  • Where are the wines produced?
  • What is the distribution of the points?
  • What is the distribution of the prices, and is it related to the points?
  • What is the distribution of the variety of wines?
  • How much tasters are there and how much reviews each of them has done?
    • Are there tasters that are more reliable than others?
    • Have the tasters a preference for a specific continent/country?
  • What are the most common words in the description of the wines?

Notebook setup¶

¶

Datset details¶

Whit the following code we can see the details of the dataset and how it is structured and the type of the columns.

---Dataset Info---
Total columns: 15
Columns names: country, description, designation, points, price, province, region_1, region_2, taster_name, taster_photo, taster_twitter_handle, title, variety, vintage, winery.
Columns type:
Types NaN Count
country [<class 'str'>, <class 'float'>] 5
description [<class 'str'>] 0
designation [<class 'str'>, <class 'float'>] 21319
points [<class 'int'>] 0
price [<class 'float'>] 4647
province [<class 'str'>, <class 'float'>] 5
region_1 [<class 'float'>, <class 'str'>] 12913
region_2 [<class 'float'>, <class 'str'>] 49894
taster_name [<class 'str'>, <class 'float'>] 150
taster_photo [<class 'str'>, <class 'float'>] 150
taster_twitter_handle [<class 'str'>, <class 'float'>] 1076
title [<class 'str'>] 0
variety [<class 'str'>] 0
vintage [<class 'str'>] 0
winery [<class 'str'>] 0
Dataframe rows: 81115
Dataset samples:
country description designation points price province region_1 region_2 taster_name taster_photo taster_twitter_handle title variety vintage winery
44388 England Red currant notes are framed by zesty lemon an... Traditional Method Rosé 93 89.0 England NaN NaN Anne Krebiehl MW https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @AnneInVino Gusbourne Estate 2015 Traditional Method Rosé ... Sparkling Blend 2015 Gusbourne Estate
45489 New Zealand Peach, pineapple, guava, tomato leaf and dried... NaN 88 13.0 Marlborough NaN NaN Christina Pickard https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @ckpickard Waxing Moon 2018 Sauvignon Blanc (Marlborough) Sauvignon Blanc 2018 Waxing Moon
79022 US This is a high-strung wine tart in acidity and... Rosé of 85 20.0 California Sonoma County Sonoma Virginie Boone https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @vboone Martin Ray 2018 Rosé of Pinot Noir (Sonoma Cou... Pinot Noir 2018 Martin Ray
11447 US Dried herb aromas are at the fore of this Cabe... NaN 91 26.0 Washington Columbia Valley (WA) Columbia Valley Sean P. Sullivan https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @wawinereport Novelty Hill 2014 Cabernet Sauvignon (Columbia... Cabernet Sauvignon 2014 Novelty Hill
34141 France Light gold in color, this opens with an assert... Terres 89 NaN Languedoc-Roussillon Pays d'Oc NaN Lauren Buzzeo https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @laurbuzz Domaine de la Baume 2016 Terres Viognier (Pays... Viognier 2016 Domaine de la Baume

It is possible to see that the dataset contains 80k rows and 15 columns. The columns are:

  • country: the country of origin of wine
  • description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
  • designation: the vineyard within the winery where the grapes that made the wine are from
  • points: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
  • price: the cost for a bottle of the wine
  • province: the province or state that the wine is from
  • region_1: the wine growing area in a province or state (ie Napa)
  • region_2: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
  • taster_name: name of the person who tasted and reviewed the wine
  • taster_photo: url of the taster's photo
  • taster_twitter_handle: Twitter handle for the person who tasted and reviewed the wine
  • title: the title of the wine review
  • variety: the type of grapes used to make the wine (ie Pinot Noir)
  • vintage: the vintage of the wine
  • winery: the winery that made the wine

Start Analysis¶

Distribution of wines across continents¶

In this section it is possible see the distribution of the wines across the continents. I used the country column to see the distribution of the wines across the continents. I decided to create a new column called continent that contains the continent of the country.

The following code shows the distribution of the wines across the continents trough a pie chart. It is possible to see that the majority of the wines are produced in Europe, followed by North America.

The above chart is an alternative way to see the distribution of the wines across the continents. It is more interactive and it is possible to see the exact number of wines produced in each continent, country and region.

Points distribution¶

Another interesting aspect of the dataset is the distribution of the points. The points are given by the tasters and they are on a scale from 80 to 100 and WineEnthusiast has another way to group the wine by 5 categories:

  • 80–82: ACCEPTABLE Can be employed
  • 83–86: GOOD Suitable for everyday consumption; often good value
  • 87–89: VERY GOOD Often good value; well recommended
  • 90–93: EXCELLENT Highly recommended
  • 94–97: SUPERB A great achievement
  • 98–100: CLASSIC The pinnacle of quality

In the following section a new column called pointsDescription is created that contains the description of the score.

From this graph it is possible to see that the majority of the wines are in the Good category, followed by the very good category (the middles scores are the most common).

It is curious to see that there are more wines with 90 points than with 89 points. That is probably because the tasters are more likely to give a wine 90 points than 89 points to have the wine labeled as Excellent.

Vintage distribution¶

In this section it is possible to see the distribution of the vintage of the wines. The vintage is the year in which the grapes were harvested.

It must be remembered that the dataset contains wines reviewed beetwen 2017 and 2020. It is normal to see that the majority of the wines are from the past years. But, there are also some very old wines in the dataset. The oldest wine is from 1931 and surprisely it does not have a very high score.

country description designation points price province region_1 region_2 taster_name taster_photo taster_twitter_handle title variety vintage winery continent pointsDescription
2722 Portugal This remarkable wine looks old, and with its d... Tinto 89 550.0 Colares NaN NaN Roger Voss https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @vossroger Adega Viuva Gomes 1931 Tinto Red (Colares) Ramisco 1931 Adega Viuva Gomes Europe Very good

Wine variety¶

In this section it is possible to see the distribution of the variety of the wines. The variety is the type of grapes used to make the wine (ie Pinot Noir). In the dataset there are many different varieties of wines but I decided to show only the top 10 varieties. It is possible to change this settings by changing the wineCountToShow variable.

Firstly, I created different versions of the dataset that thy will be used to create the graphs.

Now is finally the time to create the graphs. The left graph is a bar chart that shows the distribution of the wines, the center graph is another bar chart that shows the average points of the wines and the right graph is a box plot that shows the distribution of the prices of the wines.

It is interesting to see that the other varieties have a lot more reviews than the top 10 varieties, this means that the dataframe is well balanced.

Wine - Price connection¶

There are two principal graph in this section, the first one show a box plot rappresenting the distribution of the prices by points and the second one show a percentage histogram of the prices grouped by a personal price description:

  • x-10 usd: Low
  • 11–40 usd: Medium
  • 41–100 usd: Expensive
  • 100–x usd: Luxury

By looking at the box plot it is possible to see that the wines with the highest points are the most expensive as could be expected, so there is a strong connection between the price and the points. This is also confirmed by the following histogram that shows that the wines with the highest points are the most expensive.

It is curious to see that there are some wines with a very high price and a very low points and in the other side there are some wines with a very low price and a very high points. This means that the price is not the only factor that influence the points.

Note: I tried to create a graph object with the past two graph connected by the x-axis but it is currently not possible to do that with plotly. Further information: https://community.plotly.com/t/how-to-set-barmode-for-individual-subplots/47931

Reviewer distribution¶

Now it is time to see the distribution of the reviewers. I am interested in seeing how many reviewers there are and how many reviews each of them has done. I also want to see if there are some reviewers that are more reliable than others and if there are some reviewers that are more likely to review wines from a specific continent.

There are different considerations to make:

  • There are in total 19 reviewers and some of them have done a huge amount of reviews, as example the reviewer Roger Voss has more than 17k reviews, that are more than 15 reviews per day for 3 years.
  • The graph in the center shows the distribution of the point awarded by the reviewers. It is possible to see that the reviewers are very consistent in the points they give to the wines.
  • The graph on the right shows the preference of the reviewers for a specific continent. It is possible to see that the reviewers are more likely to review wines from their continent (example: Roger Voss and Kerin O'Keefe live in Europe and Virginie Boone and Matt Kettmann live in North America).

Most used words in wine description for points¶

In this section I decided to represent the most used words in the description of the wines for each point. I used the description column to extract the words after a cleaning process.

Classic
Acceptable
Excellent
Good
Superb
Very good